METR

AI companies and wider society want to understand the capabilities of frontier AI systems, and what risks they pose.

METR is a nonprofit research organization which studies AI capabilities, including broad autonomous capabilities and the ability of AI systems to conduct AI R&D.

Top resources

GPT-5 Evaluation Results

We evaluate whether GPT-5 poses significant catastrophic risks via AI self-improvement, rogue replication, or sabotage of AI labs. We conclude that this seems unlikely. However, capability trends continue rapidly, and models display increasing eval awareness.

Measuring AI Ability to Complete Long Tasks

We propose measuring AI performance in terms of the length of tasks AI agents can complete. We show that this metric has been consistently exponentially increasing over the past 6 years, with a doubling time of around 7 months. Extrapolating this trend predicts that, in under five years, we will see AI agents that can independently complete a large fraction of software tasks that currently take humans days or weeks.

Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity

We conduct a randomized controlled trial (RCT) to understand how early-2025 AI tools affect the productivity of experienced open-source developers working on their own repositories. Surprisingly, we find that when developers use AI tools, they take 19% longer than without—AI makes them slower.

RE-Bench — benchmark and paper tracking automation of AI R&D

Measuring the performance of humans and AI agents on day-long ML research engineering tasks

Measuring autonomous AI capabilities — resource collection

An index of our research and guidance on how to measure AI systems’ ability to autonomously complete a wide range of multi-hour tasks

Frontier AI safety policies — index and resources

A list of AI companies’ frontier safety policies intended to evaluate and manage severe AI risks

Evaluation reports

We have worked with companies such as Anthropic and OpenAI to conduct preliminary evaluations of the autonomous capabilities of several frontier AI models. We do this both to understand the capabilities of frontier models and to pilot third-party evaluator arrangements. (We do not accept compensation for this work.) We also occasionally evaluate models independently after they are released, without involvement from the model’s developers. Recent public reports resulting from this work are below, with additional discussion in the respective system cards.

GPT-5

DeepSeek and Qwen

OpenAI o3 and o4-mini

Claude 3.5 Sonnet (New)

DeepSeek-V3

OpenAI o1-preview and o1-mini

Claude 3.5 Sonnet (June 2024 version)

Partnerships

In addition to evaluations of new models, discussed above, companies such as OpenAI and Anthropic have also provided access and compute credits to support evaluation research.

We think it’s important for there to be third-party evaluators with formal arrangements and access commitments — both for evaluating new frontier models before they are scaled up or deployed, and for conducting research to improve evaluations. We do not yet have such arrangements, but we are excited about taking more steps in this direction.

We are also partnering with the AI Security Institute and are part of the NIST AI Safety Institute Consortium.

Media coverage

Recent updates

28 October 2025

Review of the Anthropic Summer 2025 Pilot Sabotage Risk Report

External review from METR of Anthropic's Summer 2025 Sabotage Risk Report

23 October 2025

Summary of our gpt-oss methodology review

Details on external recommendations from METR for gpt-oss Preparedness experiments and follow-up from OpenAI.

14 October 2025

MALT: A Dataset of Natural and Prompted Behaviors That Threaten Eval Integrity

MALT (Manually-reviewed Agentic Labeled Transcripts) is a dataset of natural and prompted examples of behaviors that threaten evaluation integrity (like generalized reward hacking or sandbagging).

03 October 2025

Early Results on Monitorability in QA Settings

Research on how AI agents can hide secondary task-solving from monitors, finding that harder tasks are more detectable and small models can learn to evade larger monitors.

22 August 2025

Claude, GPT, and Gemini All Struggle to Evade Monitors

A replication of a Google DeepMind paper on chain-of-thought monitoring, showing evidence that monitoring works on other companies' models.

20 August 2025

Forecasting the Impacts of AI R&D Acceleration: Results of a Pilot Study

AI agents are improving rapidly at autonomous software development and machine learning tasks, and, if recent trends hold, may match human researchers at challenging months-long research projects in under a decade. Some economic models predict that automation of AI research by AI agents could increase the pace of further progress dramatically, with many years of progress at the current rate being compressed into months. AI developers have identified this as a key capability to monitor and prepare for, since the national security implications and societal impacts of such rapid progress could be enormous.